Python, an Introspective

Have you ever wondered what Python is made from? Did it crawl out of the primordial-ooze of assembly code directly into the beautiful high-level language that we love to use? Nope! In fact, the language itself is abstract, based on a set of rules and regulations, with a set of core libraries and concepts defined for each specific language version (e.g. 2.7.6 or 3.4.1). In order to actually be able to use the language, the Python language must be implemented according to this set of rules and regulations.

What we usually think of as Python is actually a C-based implementation of the Python language, or CPython. There are other implementations of Python: Jython, IronPython (Java-based), or PyPy (RPython), but CPython is by far the largest and most commonly used.

The set of tools, rules, and regulations for implementing Python in C is contained in the Python/C API. One of the great things about Python as an open-source language is that the full API is freely available. In other words, you have access to the same exact stuff that the Python core developers do! In theory, there's nothing stopping you from implementing your own version of Python! (Keep in mind however that there have been dozens of core developers, plus thousands of brilliant contributors to CPython over 23 years... so you might want to hold off on starting over from scratch)

This is where the idea of extending python comes from. Using the C/API (or higher-level toolkits that handle some of the trickier parts of the API for you) anybody can build their own C-extension than can then be used just like any other library in python. In this demo, we'll very briefly touch on a couple of tools that allow us to extend Python in this manner.

Why Extend Python?

Extending python is a great solution when you don't want to give up the convience, flexibility, and feature-richness of the Python environment, but need to leverage some C capabilities as well. Some of the more common use-cases for extending python include:

  • To speed up your python program
    • Computation-bound code that doesn't necessarily lend itself to vectorization (loop+many conditionals)
    • Can achieve O(100-1000) speedup over python alone depending on the application
  • To interface with existing C-code
    • Have massive legacy C-code that you'd like to be able to utilize in python
    • Often comes in handy when interfacing with hardware
  • To contribute to Python!
    • High-performance user-defined types for scientific applications

The Real Question: How?

There are clearly many scenarios where extending python would be advantageous. In fact, the demand for extending python is so prevelant that there are many different ways to go about it; so many that I can't even list them all here. Some of the more common approaches (including the 3 that we'll talk about today) are listed below.

  • Python/C API - You can use the API itself to create C-extensions. This is the lowest-level approach, thus it gives the user the greatest flexibility, but also has the steepest learning-curve
  • Cython - The cython website describes the project as an "optimizing static compiler ... that makes writing C extensions for python as easy as python itself". It allows you to write code with pythonic syntax (although you'll have to heavily modify to truly unlock the full potential of the C-extension) that will then be converted to C for you and compiled, producing a C-extension based on your cython code.
  • Boost.python - A pure C++ library that converts your C/C++ into python modules via C++ macro-ninja-wizardry

Other options that we won't be discussing in any detail today include:

  • SWIG
  • ctypes
  • Pyrex
  • psyco

It is important to note that all of the extension options (cython, boost, swig, etc.) are based on the same C/API, but much effort has been put into simplifying the process of creating the C-extension. Many of these tools remove various parts of the "wrapping" process (e.g. reference counting, other nasty C things) from the user and attempt to automate them instead.

The result

We've talked a lot about extending python without actually mentioning what the end product of the extension process is. In fact, the term "extension" might be a bit misleading, because you are not necessarily adding any utility to the core python interpreter or anything like that (although it is possible for you to do so within the API). Most of the common use cases listed above involve generating a python module full of functions and/or user-defined types/objects that can be imported and used within the python environment.

In other words, the extension process compiles your C-code and converts it into a dynamic library (.pyd on windows systems or .so on *nix systems) that can be imported just like any other python module. The flowchart below attempts to summarize the process.


In [1]:
from IPython.display import Image
Image('./PythonExtensionFlowchart.png')


Out[1]:

A Closer Look...

Let's take a closer look at the extension tools we'll look at today.

Full-Blown C/API

As indicated above, you can learn the C/API yourself and write any C-code you want to produce Python modules based on your C-code. The main advantage here is that you have full access to the API without any abstraction layers or "middle-men" in between you and the C-code. This direct access is also a double-edged sword however in that you are responsible for all of the things required by the C/API. One of the worst things about the C/API is dealing with how python does memory management via reference counting. This can be a huge pain to get correct and very difficult to debug (in fact, I'm not confident that I've done it correctly in some of my C-extensions).

Because you have to be aware of the API rules and get familiarized with the python implementation, this method probably also has the steepest learning curve. Despite these drawbacks, this is still an excellent way to extend python, and you'll learn an awful lot in the process too. I've used the C/API to write analysis code that is relatively general and often reused on different data sets, as well as to wrap legacy C-code for interfacing with hardware.

Cython

Cython is a superset of the Python language, meaning that all python syntax is valid cython syntax, but there are additional cythonic things that you can introduce to the code to give the cython compiler additional information that will be used to auto-generate efficient C-code from your cython code.

Cython handles the wrapping step from the flowchart above by auto-generating C-code from your cython input file. This makes cython relatively easy to start using (all you have to do is append .pyx to any .py file and use distutils to build the cython extension). It's not all cake and rainbows however; because all of the wrapping is being autogenerated for you, you don't necessarily know what is going on in that step. As a result, you will likely not get the behavior/performance gains that you were expecting the first time you use cython, due to inefficient switching between C and python code. Fortunately, there are tools to help you debug this and learn as you go.

To be a truly proficient cython programmer probably involves learning the same amount of new rules and syntax as does learning the C/API, but for simple C-extensions cython is probably your best bet to get a quick start.

Boost.python

There are boost libraries for pretty much every language (just google boost.) and frankly I am unfamiliar with the C++ Boost libraries that are the precursor for other boost projects. What I do know is boost.python is another toolkit that does away with the wrapping step (almost) entirely; directly exposing your C/C++ functions and classes to python. Unlike cython, there is no conversion from a 3rd language to C; boost.python converts your valid C/C++ code directly into a python module without any visible intermediate steps.

The main advantage here is that if you're a C++ programmer, you are one or two macro lines away from being the author of C-extensions. The ease with which C++ classes can be converted into python classes is particularly intriguing. I've personally never used boost.python (I was only introduced to it this past summer) but it's advantages definitely seem to make it worth learning about.

The tutorial

To introduce these three tookits, I am going to solve a simple (but non-trivial) example problem using each of the three extension methods outlined above. I think an example that is simple enough to understand but more complex than the simple hello world will allow us to expose some of the true functionality and quirks of these three extension tookits.

The example problem


In [2]:
Image('DSSD_cartoon.png')


Out[2]:

I decided to steal a real problem from my research to demonstrate C-extensions. I work on segmented-readout semiconductor detectors for detecting and imaging gamma-rays. The segmentation of the readouts gives us information about the 3D position of gamma-ray interactions within the detector, but also introduces an additional complication: every gamma-ray interaction within the sensitive volume of the detector will produce signals on multiple readout strips (see illustration). Ultimately, we want to reconstruct the location of the gamma-ray interaction within the detector from the strip responses. These particular detectors have a full charge-collection time of up to 250 ns, which means that any strips that collect charge within 250 ns of each other are likely to have resulted from a single gamma-ray interacting within the detector.

Our challenge is to find these readouts that correspond to potential gamma-ray interactions from a time-sorted list of strip readout timestamps.


In [3]:
Image('some_data.png')


Out[3]:

In [1]:
%pylab


Using matplotlib backend: Qt4Agg
Populating the interactive namespace from numpy and matplotlib

Load some actual timestamp data


In [4]:
ts = np.loadtxt('timestamp_data.txt', dtype=np.uint64)

In [3]:
print ts.shape
ts


(1000000,)
Out[3]:
array([     574428,      574431,      574435, ..., 22850166453,
       22850166455, 22850166456], dtype=uint64)

Pure python code (no extensions)


In [4]:
from python_version import find_potential_event_ptrs as fpep_py

In [5]:
ptrs, lens = fpep_py(ts, 25, 2, 18)

In [8]:
%%timeit
ptrs, lens = fpep_py(ts, 25, 2, 18)


1 loops, best of 3: 4.33 s per loop

Cython version


In [5]:
from cython_version import find_potential_event_ptrs as fpep_cy

In [6]:
ptrs, lens = fpep_cy(ts, 25, 2, 18)

In [7]:
%%timeit
ptrs, lens = fpep_cy(ts, 25, 2, 18)


100 loops, best of 3: 9.84 ms per loop

C/API version


In [8]:
from c_code.c_version import find_potential_event_ptrs as fpep_c

In [9]:
ptrs, lens = fpep_c(ts, 25, 2, 18)

In [10]:
%%timeit
ptrs, lens = fpep_c(ts, 25, 2, 18)


100 loops, best of 3: 14.8 ms per loop

In [20]:
ptrs, lens = fpep_c(ts, 25, 2, 18)

In [21]:
len(ptrs)


Out[21]:
283737

In [13]:
ptrs


Out[13]:
[0,
 3,
 9,
 11,
 14,
 17,
 19,
 22,
 25,
 28,
 32,
 39,
 41,
 44,
 49,
 51,
 53,
 60,
 63,
 65,
 69,
 72,
 74,
 82,
 87,
 92,
 96,
 98,
 100,
 104,
 111,
 113,
 118,
 120,
 127,
 131,
 134,
 141,
 143,
 146,
 148,
 151,
 154,
 162,
 169,
 172,
 175,
 182,
 188,
 192,
 197,
 199,
 202,
 204,
 207,
 209,
 212,
 217,
 220,
 222,
 227,
 232,
 234,
 241,
 245,
 248,
 253,
 260,
 263,
 266,
 268,
 271,
 276,
 280,
 285,
 288,
 291,
 295,
 302,
 304,
 307,
 313,
 315,
 318,
 320,
 323,
 325,
 329,
 331,
 333,
 335,
 341,
 347,
 355,
 357,
 359,
 364,
 366,
 369,
 374,
 377,
 380,
 383,
 388,
 391,
 393,
 396,
 402,
 404,
 406,
 411,
 419,
 421,
 423,
 426,
 429,
 435,
 439,
 442,
 446,
 451,
 454,
 456,
 460,
 464,
 466,
 468,
 471,
 473,
 477,
 480,
 483,
 485,
 487,
 489,
 493,
 495,
 499,
 501,
 503,
 506,
 511,
 515,
 518,
 522,
 524,
 528,
 535,
 537,
 543,
 545,
 547,
 551,
 555,
 559,
 569,
 572,
 578,
 580,
 584,
 591,
 594,
 596,
 599,
 605,
 609,
 613,
 619,
 621,
 623,
 628,
 630,
 632,
 635,
 643,
 647,
 655,
 658,
 660,
 664,
 670,
 672,
 674,
 678,
 682,
 689,
 691,
 693,
 695,
 697,
 701,
 706,
 710,
 714,
 716,
 721,
 726,
 733,
 736,
 738,
 740,
 743,
 746,
 748,
 750,
 753,
 762,
 765,
 768,
 771,
 774,
 779,
 781,
 783,
 785,
 788,
 790,
 796,
 801,
 803,
 805,
 807,
 809,
 813,
 816,
 821,
 825,
 828,
 836,
 839,
 844,
 846,
 850,
 853,
 857,
 861,
 865,
 873,
 875,
 878,
 881,
 883,
 889,
 891,
 894,
 898]

In [ ]: